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Abstract 

To preserve client privacy in the data mining process, 
a variety of techniques based on random perturbation of 
data records have been proposed recently. In this paper, 
we present a generalized matrix-theoretic model of random 
perturbation, which facilitates a systematic approach to the 
design of perturbation mechanisms for privacy-preserving 
mining. Specifically, we demonstrate that (a) the prior tech- 
niques differ only in their settings for the model parame- 
ters, and (b) through appropriate choice of parameter set- 
tings, we can derive new perturbation techniques that pro- 
vide highly accurate mining results even under strict pri- 
vacy guarantees. We also propose a novel perturbation 
mechanism wherein the model parameters are themselves 
characterized as random variables, and demonstrate that 
this feature provides significant improvements in privacy at 
a very marginal cost in accuracy. 

While our model is valid for random-perturbation-based 
privacy-preserving mining in general, we specifically eval- 
uate its utility here with regard to frequent-itemset mining 
on a variety of real datasets. The experimental results indi- 
cate that our mechanisms incur substantially lower identity 
and support errors as compared to the prior techniques. 



founded) worry that the requested information may be mis- 
used by the service provider to harass the customer As a 
case in point, consider a pharmaceutical company that asks 
clients to disclose the diseases they have suffered from in or- 
der to investigate the correlations in their occurrences - for 
example, "Adult females with malarial infections are also 
prone to contract tuberculosis". While the company may 
be acquiring the data solely for genuine data mining pur- 
poses that would eventually reflect itself in better service to 
the client, at the same time the client might worry that if 
her medical records are either inadvertently or deliberately 
disclosed, it may adversely affect her employment opportu- 
nities. 

To encourage users to submit correct inputs, a variety of 
privacy-preserving data mining techniques have been pro- 
posed in the last few years 01 IH El Oil 123 . The goal of 
these techniques is to keep the raw local data private but, at 
the same time, support accurate reconstruction of the global 
data mining models. Most of the techniques are based on 
a data perturbation approach, wherein the user data is dis- 
torted in a probabilistic manner that is disclosed to the even- 
tual miner. For example, in the MASK technique 1181 . 
intended for privacy-preserving association-rule mining on 
sparse boolean databases, each or 1 in the original user 
transaction vector is flipped with a parametrized probability 
l-p. 



1. Introduction 



1.1. The FRAPP Framework 



The knowledge models produced through data mining 
techniques are only as good as the accuracy of their input 
data. One source of data inaccuracy is when users, due 
to privacy concerns, deliberately provide wrong informa- 
tion. This is especially common with regard to customers 
asked to provide personal information on web forms to e- 
commerce service providers. 

The compulsion for doing so may be the (perhaps well- 



The trend in the prior Uterature has been to propose spe- 
cific perturbation techniques, which are then analyzed for 
their privacy and accuracy properties. We move on, in this 
paper, to presenting FRAPP (FRamework for Accuracy in 
Privacy-Preserving mining), a generalized matrix-theoretic 
framework that facilitates a systematic approach to the de- 
sign of random perturbation schemes for privacy -preserving 
mining. While various privacy metrics have been discussed 
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in the literature, FRAPP supports a particularly strong no- 
tion of privacy, originally proposed in |13|. Specifically, 
it supports a measure called "amplification", which guar- 
antees strict limits on privacy breaches of individual user 
information, independent of the distribution of the original 
data. 

FRAPP quantitatively characterizes the sources of error 
in random data perturbation and model reconstruction pro- 
cesses. We first demonstrate that the prior techniques differ 
only in their settings for the FRAPP parameters. Further, 
and more importantly, we show that through appropriate 
choice of parameter settings, new perturbation techniques 
can be constructed that provide highly accurate mining re- 
sults even under strict privacy guarantees. Efficient imple- 
mentations for these new perturbation techniques are also 
presented. 

We investigate here, for the first time, the possibility of 
randomizing the perturbation parameters themselves. The 
motivation is that it could lead to an increase in privacy 
levels since the exact parameter values used by a specific 
client will not be known to the data miner. This scheme has 
the obvious downside of perhaps reducing the model recon- 
struction accuracy. However, our investigation shows that 
the tradeoff is very attractive in that the privacy increase is 
substantial whereas the accuracy reduction is only marginal. 
This opens up the possibility of using FRAPP in a two-step 
process: First, given a user-desired level of privacy, iden- 
tifying the deterministic values of the FRAPP parameters 
that both guarantee this privacy and also maximize the ac- 
curacy; and then, (optionally) randomizing these parame- 
ters to obtain even better privacy guarantees at a minimal 
cost in accuracy. 

The FRAPP model is valid for random-perturbation- 
based privacy-preserving mining in general. Here, we fo- 
cus on its applications to categorical databases, where the 
domain of each attribute is finite. Note that boolean data 
is a special case of this class, and further, that continuous- 
valued attributes can be converted into categorical attributes 
by partitioning the domain of the attribute into fixed length 
intervals. 

To quantitatively evaluate FRAPP's utility, we specif- 
ically evaluate the performance of our new perturbation 
mechanisms on the popular mining task of finding frequent 
itemsets, the cornerstone of association rule mining. Our 
evaluation on a variety of real datasets shows that both iden- 
tity and support errors are substantially lower than those in- 
curred by the prior privacy-preserving techniques. 

1.2. Contributions 

In a nutshell, FRAPP provides a mathematical founda- 
tion for "raising both the accuracy and privacy bars in strict 
privacy-preserving mining". Specifically, our main contri- 



butions are as follows: 

• FRAPP, a generalized matrix-theoretic framework for 
random perturbation and mining model reconstruction; 

• Using FRAPP to derive new perturbation mechanisms 
for minimizing the model reconstruction error while 
ensuring strict privacy guarantees; 

• Introducing the concept of randomization of perturba- 
tion parameters, and thereby deriving enhanced pri- 
vacy; 

• Efficient implementations of the perturbation tech- 
niques for the proposed mechanisms; 

• Quantitatively demonstrating the utility of our schemes 
in the context of association rule mining. 

1.3. Organization 

The remainder of this paper is organized as follows: The 
FRAPP framework for data perturbation and model recon- 
struction is presented in Section |2l Appropriate choices of 
FRAPP parameters for simultaneously guaranteeing strict 
data privacy and providing high model accuracy are dis- 
cussed in Section|3l The impact of randomizing the FRAPP 
parameters is investigated in Section 0] Efficient schemes 
for implementing the new perturbation mechanisms are de- 
scribed in Section |5] In Section |6j we discuss the applica- 
tion of our mechanisms to association rule mining. Then, 
in Sectional the utility of FRAPP in the context of associa- 
tion rule mining is quantitatively investigated. Related work 
on privacy-preserving mining is reviewed in Section|8l Fi- 
nally, in Section |9l we summarize the conclusions of our 
study and outline future research avenues. 

2. The FRAPP Framework 

In this section, we describe the construction of the 
FRAPP framework, and its quantification of privacy and ac- 
curacy measures. 

Data Model. We assume that the original database U con- 
sists of N records, with each record having M categori- 
cal attributes. The domain of attribute j is denoted by S^j, 
resulting in the domain Su of a record in U being given 

by S't/ = ^Slj. We map the domain Sjj to index set 

/;/ = {1, . . . , jS'jyj}, so that we can model the database as 
set of N values from Ijj. Thus, if we denote i*'' record of 
U as Ui, we have 

U = {U,}fLi, e lu 
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Perturbation Model We consider the privacy situation 
wherein the customers trust no one except themselves, that 
is, they wish to perturb their records at their cHent site be- 
fore the information is sent to the the miner, or any interme- 
diate party. This means that perturbation is done at the level 
of individual customer records Ui, without being influenced 
by the contents of the other records in the database. 

For this situation, there are two possibilities: a simple in- 
dependent column perturbation, wherein the value of each 
attribute in the record is perturbed independently of the 
rest, or a more generalized dependent column perturbation, 
where the perturbation of each column may be affected by 
the perturbations of the other columns in the record. Most 
of the prior perturbation techniques, including P12l ll3l[T8l . 
fall into the independent column perturbation category. The 
FRAPP framework, however, includes both kinds of pertur- 
bation in its analysis. 

Let the perturbed database htV — {Vi, . . . , Vn}, with 
domain Sv, and corresponding index set ly- For each orig- 
inal customer record Ui — u,u G I[7, a new perturbed 
record Vi = v,v G ly is randomly generated with proba- 
bility p{u v). Let A denote the matrix of these transition 
probabilities, with = p{u v). This random process 
maps to a Markov process, and the perturbation matrix A 
should therefore satisfy the following properties ll22l : 

Ayu =1 \/uG lu 

veiv 

A„„>0 \/ueIu,veIv (1) 



Due to the constraints imposed by Equation the domain 
of A is not rI'5cHx|Sv| ^ subset of it. This domain is 
further restricted by the choice of perturbation method. For 
example, for the MASK technique P181 mentioned in the 
Introduction, all the entries of matrix A are decided by the 
choice of the single parameter p. 

In this paper, we propose to explore the preferred choices 
of A to simultaneously achieve privacy guarantees and high 
accuracy, without restricting ourselves ab initio to a partic- 
ular perturbation method. 

2.1. Privacy Guarantees 

The miner receives the perturbed database V and at- 
tempts to reconstruct the original probability distribution of 
database U using this perturbed data and the knowledge of 
the perturbation matrix A. 

The prior probability of a property of a customer's pri- 
vate information is the likelihood of the property in the ab- 
sence of any knowledge about the customer's private infor- 
mation. On the other hand, the posterior probability is the 
likelihood of the property given the perturbed information 



from the customer and the knowledge of the prior proba- 
bilities through reconstruction from the perturbed database. 
As discussed in |13|, in order to preserve the privacy of 
some property of a customer's private information, we de- 
sire that the posterior probability of that property should not 
be much higher than the prior probability of the property for 
the customer This is quantified by saying that a perturba- 
tion method has privacy guarantees (pi , p2 ) if, for any prop- 
erty Q{Ui) with prior probability less than pi, the posterior 
probability of the property is guaranteed to be less than p2- 
For our formulation, we derive (using Definition 3 and 
Statement 1 from 1 13 1) the following condition on the per- 
turbation matrix A in order to support (pi, P2) privacy. 



A 



A 



< < 



Pi, 



Pl(l - P2) 



ui,U2 G Iu,yv G Iv (2) 



That is, the choice of perturbation matrix A should follow 
the restriction that the ratio of any two entries should not be 
more than 7. 

2.2. Reconstruction Model 

We now analyze how the distribution of the original 
database can be reconstructed from the perturbed database. 
As per the perturbation model, a client Ci with data record 
Ui = u,u e lu generates record Vi — v,v £ ly 
with probability p[u — *■ v]. This event of generation of v 
can be viewed as a Bernoulli trial with success probability 
p[u — > u]. If we denote outcome of i*'* Bernoulli trial by 
random variable Y^, then the total number of successes 
in N trials is given by sum of the N Bernoulli random vari- 
ables, i.e. 



Y, 



N 
i=l 



(3) 



That is, the total number of records with value v in the per- 
turbed database will be given by the total number of suc- 
cesses Yy. 

Note that Yy is the sum of A'^ independent but non- 
identical Bernoulli trials. The trials are non-identical be- 
cause the probability of success in a trial i varies from an- 
other trial j and actually depends on the values of Ui and 
Uj, respectively. The distribution of such a random variable 
Yy is known as the Poisson-Binomial distribution 1251 . 

Now, from Equation|3] the expectation of Yy is given by 



N 



N 



E{Yy) = ^(^v) = E = 1) (4) 

i=l i=l 

Let Xu denote the number of records with 
value u in the original database. Since 

P(FJ = 1) = p[u i;] = Ayu, for U^ = u, we get 



(5) 



u£lu 
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Let X = [X1X2 
from Equation|5]we get 



\Sv\ 



E{Y) = AX 



then 



(6) 



We estimate X ■&& X given by the solution of following 
equation 

Y = AX (7) 

which is an approximation to Equation|S] This is a system 
of \Sv \ equations in \Su \ unknowns. For the system to be 
uniquely solvable, a necessary condition is that the space of 
the perturbed database is larger than or equal to the original 
database (i.e. \Sv\ > \Su\)- Further, if the inverse of matrix 
A exists, then we can find the solution of above system of 
equations by 



X = A-'Y 



(8) 



That is, Equation|8]gives the estimate of the distribution of 
records in the original database, which is the objective of 
the reconstruction exercise. 

2.3. Estimation Error 

To analyze the error in the above estimation process, 
we use the following well-known theorem from linear 
algebra 1,22,1 : 



Theorem 1: For an equation of form Ax 
relative error in solution x = A^^b satisfies 



b, the 



II Sx II 



< c- 



56 II 



where c is the condition number of matrix A. For a positive 
definite matrix, c = \nax/^m.in, where X,nax and A„i„ are 
the maximum and minimum eigen values of n x n matrix 
A. Informally, the condition number is a measure of stabil- 
ity or sensitivity of a matrix to numerical operations. Ma- 
trices with condition numbers near one are said to be well- 
conditioned, whereas those with condition numbers much 
greater than one (e.g. 10^ for a 5 * 5 Hilbert matrix [22J) are 
said to be ill-conditioned. 

From Equations|6j|8land the above theorem, we have 



X-X\\ ^^\\Y- E{Y) 



X 



II E{Y) II 



(9) 



This inequality means that the error in estimation arises 
from two sources: First, the sensitivity of the problem 
which is measured by the condition number of matrix A; 
and, second, the deviation of Y from its mean as measured 
by the variance of Y . 

As discussed above, Y^ is a Poisson-Binomial distributed 
random variable. Hence, using the expression for variance 



of a Poisson-Binomial random variable 1251 . we can com- 
pute the variance of Y^ to be 



Var{Y,) = A„X{l~^A,X) 

- J2 - ^A,X)^Xu (10) 

which depends on the perturbation matrix A and the dis- 
tribution X of records in the original database. Thus the 
effectiveness of the privacy preserving method is critically 
dependent on the choice of matrix A. 

3. Choice of Perturbation Matrix 

The various perturbation techniques proposed in the lit- 
erature primarily differ in their choice for perturbation ma- 
trix A. For example. 



• MASK (TU uses the matrix A with 



(11) 



where k is the number of attributes with matching val- 
ues in perturbed value v and original value u, Mb is the 
number of boolean attributes when each categorical at- 
tribute j is converted into | Sjj \ boolean attributes, and 
1 — p is the value flipping probability. 

• The cut-and-paste randomization operator 1121 em- 
ploys a matrix A with 



where 



Pm[z\ = 



q ' \ z — q 



M min{z ^1^^ .1^^ 

Y,Pm[z] Y1 (M^ 

z=0 (j = maa:{0,z + !u-A/,iu+i„-A4} \z) 



Mb ~ lu\ J^l^-q) 



i^h-l^-K+q) 



(12) 



min{ .2} 



p (1 

z — w 



iu=0 

1-M/{K+1) ifw 
1/{K + 1) o.w. 



M w\dw < K 



Here 1^ and 1^ are the number of 1" in the original 
record u and its corresponding perturbed record v, re- 
spectively, while K and p are operator parameters. 

For enforcing strict privacy guarantees, the parameters 
for the above methods are decided by the constraints on 
the values of perturbation matrix A given in Equation|2] It 
turns out that for practical values of privacy requirements. 
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the resulting matrix A for these schemes is extremely ill- 
conditioned - in fact, we found the condition numbers in 
our experiments to be of the order of 10^ and 10^ for MASK 
and the Cut-and-Paste operator, respectively. 

Such ill-conditioned matrices make the reconstruction 
very sensitive to the variance in the distribution of the 
perturbed database. Thus, it is important to carefully 
choose the matrix A such that it is well-conditioned (i.e 
has a low condition number). If we decide on a distortion 
method apriori, as in the prior techniques, then there is little 
room for making specific choices of perturbation matrix A. 
Therefore, we take the opposite approach of first designing 
matrices of the required type, and then devising perturba- 
tion methods that are compatible with the chosen matrices. 

To choose the appropriate matrix, we start from the intu- 
ition that for 7 = cxD, the matrix choice would be the unity 
matrix, which satisfies the constraints on matrix A imposed 
by Equations [2 and 12 and has condition number 1. Hence, 
for a given 7, we can choose the following matrix: 



For a symmetric positive definite matrix, the condition 
number is given by 



A, 



(13) 



IX, if i = j 

X, o.w. 

where 

1 

j + i\Su\-l) 
This matrix will be of the form 



7 1 1 
1 7 1 
1 1 7 



It is easy to see that the above matrix, which incidentally 
is a symmetric Toeplitz matrix |22|, satisfies the conditions 
given by Equations ^ and |2] Further, its condition number 
can be computed to be 1 + For ease of exposition, we 
will hereafter refer to this matrix informally as the "gamma- 
diagonal matrix". 

At this point, an obvious question is whether it is possi- 
ble to design matrices that have even lower condition num- 
ber than the gamma-diagonal matrix. In the remainder of 
this section, we prove that within the constraints of our 
problem, the gamma-diagonal matrix has the lowest possi- 
ble condition number, that is, it is an optimal choice (albeit 
non-unique). 



An 



A. 



(14) 



where Xmax and Xmin are the maximum and minimum 
eigenvalues of the matrix. As the matrix A is a Markov 
matrix (refer Equation QJ, the following theorem for 
eigenvalues of a matrix can be used 



Theorem 2 II22I For an n x n Markov matrix, 

• 1 is an eigenvalue 

• the other eigenvalues satisfy | Ai | < 1 

Theorem 3 |221 The sum ofn eigenvalues equals the sum 
ofn diagonal entries: 

\l + --- + Xn= Aii+---+Ann 

Using Theorem 2 we get, 

Xmax 1 

As the least eigenvalue Xmin will always be less than or 
equal to average of the eigenvalues other than Xmax, we 
get, 



1 

Xmin fE; 7 / , Ai 

n — 1'^-^ 

i=2 

where Ai = Xmax Using Theorem 3, 

Xmin — 

Hence, condition number, 

1 n-1 

> 




(15) 



(16) 



\ . - V" 4 -1 

^mm jL^i—1 ^'i 

Now, due to privacy constraints on A given by Equation|2l 

Ah < jAij for any j ^ i, 

i.e.. 

An < jAii 
An < ^Ai2 



Proof. To prove this, we will first derive the expression 
for minimum condition number for such matrices and the 
conditions under which that condition number is achieved. 
Then we show that our gamma-diagonal matrix satisfies 
these conditions, and has minimum condition number. 



Summing above. 



{n-\)An < 7Xl^'y 
= l{l-Au) 
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where the last step is due to the condition on A given by 
Equation^ Solving for An, we get, 



< 



7 



J + n — 1 



Using above inequality in Equation[^ we get 

n — I 7 + rt — 1 



c > 



7+n— 1 



1 



7-1 



(17) 



(18) 



Hence minimum condition number for the symmetric 
perturbation matrices under privacy constraints represented 
by 7 is 2±ILzi. This condition number is achieved when 

A - = ^ 

7+n— 1 ■ 

The diagonal values of gamma-diagonal matrix given by 
Equation^ is . Thus it is minimum condition num- 

ber symmetric perturbation matrix, with condition number 

l+\Su\-l 
7-1 ■ 

4. Randomizing the Perturbation Matrix 

The estimation model in the previous section implic- 
itly assumed the perturbation matrix A to be deterministic. 
However, it appears intuitive that if the perturbation matrix 
parameters are themselves randomized, so that each client 
uses a perturbation matrix that is not specifically known to 
the miner, the privacy of the client will be further increased. 
Of course, it may also happen that the reconstruction accu- 
racy may suffer in this process. 

In this section, we explore this tradeoff. Instead of de- 
terministic matrix A, the perturbation matrix here is matrix 
A of random variables, where each entry A^u is a random 
vaiable with E{Avu) — A^^- The values taken by the ran- 
dom variables for a chent d provide the specific values for 
his/her perturbation matrix. 

4.1. Privacy Guarantees 

Let Q{Ui) be a property of client C/s private informa- 
tion, and let record Ui — uht perturbed to = v. Denote 
the prior probability of Q{Ui) by P{Q{Ui)). On seeing the 
perturbed data, the posterior probability of the property is 
calculated to be: 



p{Qm\v, = v) 



E Pu,\Vi{u\v) 



PvAv) 



When we use a fixed perturbation matrix A for all clients i, 
then Py./y. (w/u) — A„„,V?. Hence 



p{Qm\v, = v)-. 



E<3(u) PuA'u)Av^, 



As discussed in [131, the data distribution Pjj. in the 
worst case can be such that P{Ui — u) > {) only if 

{u £ Iu\Q{u) and Avu = maxp} 



{u G Iu\^Q{u) and Ayu 
so that 

P{Q{U^)/V^ = V) 



m.inp}. 



P{Q{u)) ■ maxp 



P{Q{u)) ■ maxp + P{^Q{u)) ■ minp 



where maxp — maxQ(u/) and minp = min^Qju/j Ayy^i. 
Since the distribution Pu is known through reconstruc- 
tion to the miner, and matrix A is fixed, the above poste- 
rior probability can be determined by the miner. For ex- 
ample, if P{Q{u)) — 5%, 7 = 19, the posterior probabil- 
ity can be computed to be 50% for perturbation with the 
gamma-diagonal matrix. 

But, in the randomized matrix case where Pvi/Ui / is 
a realization of random variable A, only its distribution and 
not the exact value for a given i is known to the miner Thus 
determinations like the above cannot be made by the miner 
for a given record Ui. For example, suppose we choose 
matrix A such that 



A,. 



■yx - 
X — 



|S[/|-1' 



if u - 
o.w. 



where x 



-f+{\Su \ -i) ^"'^ is a random variable uniformly 
distributed between [—a, a]. Thus, the worst case posterior 
probability for a record Ui is now a function of the value of 
r, and is given by 



P2(r) 



P(Q(u)) -jx + r 



P{Q{u)) ■ {jx + r)+ P{-^Q{u))(x - 



Therefore, only the posterior probability range, i.e. 
[P2 ^ P2] ~ [P2(— a), P2{+ci)], and the distribution over the 
range, can be determined by the miner. For example, 
for the situation P{Q{u)) = 5%, 7 = 19, a = 'yx/2, he can 
only say that the posterior probability lies in the range 
[33%, 60%] with its probability of being greater than 50% 
{p2 corresponding to r = 0) equal to its probability of being 
less than 50%. 

4.2. Reconstruction Model 

The reconstruction model for the deterministic perturba- 
tion matrix A was discussed in Section I2T2I We now de- 
scribe the changes to this analysis for the randomized per- 
turbation matrix A. 

The probability of success for Bernoulh variable is now 
modified to 

P(K; = l)=i;„, for U,=u 

where A* ^ denotes the i*'* realization of random variable 
4 
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Thus, from Equation|4] 

JV 

E{Y^) = ^p(y; = i) 

= ^ ^ Ku 

u£Iu {i\Ui—u} 

^ E{Y) = 



(19) 



(20) 



where A^u = I]{i|(7,=u} ^™ average of the val- 
ues taken by Ayu for the clients whose original data record 
had value u. 

Ayu is a random variable with expectation 
E{Ayu) — Ayu, it can be easily seen that, 



Hence, from Equation^] we get 

E{E{Y)) = AX 



(21) 



(22) 



We estimate X as X given by the solution of following 
equation 

Y = AX (23) 

which is an approximation to Equation|22] From Theorem 1 
in Section lZ2l the error in estimation is bounded by: 



X-X\\ ^^\\Y -E{E{Y)) 



X 



II E{E{Y)) 



(24) 



where c is the condition number of perturbation matrix A. 

We now compare these bounds with the corresponding 
bounds of the deterministic case. Firstly, note that, due to 
the use of the randomized matrix, there is a double expec- 
tation for Y on the RHS of the inequality, as opposed to 
the single expectation in the deterministic case. Secondly, 
only the numerator is different between the two cases since 
E{E{Y)) = AX. Now, we have 

\\Y-E{E{Y)) II 
= \\{Y -E{Y)) + {E(Y)-E{E{Y)))\\ 
< \\Y~E{Y)\\ + \\E{Y)-E{E{Y))\\ 



Here I| Y — E{Y) \\ is given by the variance of random vari- 
able Y. Since Yy, as discussed before, is Poisson-binomial 
distributed, its variance is given by 1125 1 



(25) 



where p,„ = ^ Pj, and pi = P{Y: = 1). 

It is easily seen (by elementary calculus or induction) 
that among all combinations {p* } such that J^iPv ~ ''Vy^ 
the sum J^i (pI)'^ assumes its minimum value when all pi 



are equal. It follows that, if the average probability of suc- 
cess Py is kept constant, Var{Yy) assumes its maximum 
value when pi = ■ ■ ■ = p^ . In other words, the variability 
of pI, or its lack of uniformity, decreases the magnitude of 
chance fluctuations, as measured by its variance 1141 . On 
using random matrix A instead of deterministic A we in- 
crease the variability of (now p^ assumes variable values 
for all i), hence decreasing the fluctuation of Yy from its 
expectation, as measured by its variance. 

Hence, || Y — E{Y) \\ is likely to be decreased as com- 
pared to the deterministic case, thereby reducing the er- 
ror bound. On the other hand, the positive value 
II E{Y) - E{E{Y)) 11 = 11 (A-A)X ||, which depends upon 
the variance of the random variables in A, was in the de- 
terministic case. Thus, the error bound is increased by this 
term. 

So, we have a classic tradeoff situation here, and as 
shown later in our experiments of Section the tradeoff 
turns out very much in our favour with the two opposing 
terms almost canceling each other out, making the error 
only marginally worse than the deterministic case. 

5. Implementation of Perturbation Algorithm 

To implement the perturbation process discussed in the 
previous sections, we effectively need to generate for each 
Ui = u, a discrete distribution with PMF P{v) — Ayy and 
CDF F{v) = J2^<v defined over v ^ 1, . . . ,\ Sy \. 

A straightforward algorithm for generating the perturbed 
record v from the original record u is the following 

1. Generate r ~ W(0, 1) 

2. Repeat for v = 1, . . . , | Sy \ 

ifF{v - 1) < r < F{v) 
return Vi — v 

where Z//(0, 1) denotes uniform distribtion over range [0, 1] 
This algorithm, whose complexity is proportional to the 
product of the cardinalities of the attribute domains, will re- 
quire I Sv I /2 iterations on average which can turn out to 
be very large. For example, with 31 attributes, each with 
two categories, this amounts to 2^^ iterations for each cus- 
tomer! We therefore present below an alternative algorithm 
whose complexity is proportional to the sum of the cardi- 
nahty of the attribute domains. 

Given that we want to perturb the record Ui = u, we can 
write 



P{V,-U,=u) 

= P{V^i,...,Vm]u) 

= P{V,i-u)-P{V,2\V,i-u)^ 



■P{ViM\Vii, . . . ,T^i(Af-i);^ 
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For the perturbation matrix A, we get the following ex- 
pressions for the above probabilities: 

P{Vii ^ a;u) ^ ^ Ayu 

{v\v{l)—a] 

P{Va = b, K:i = a; u) 



P{V^2^b\V,i^a;: 



_ '^{v\v{l)=a&cv{2)=b} 
P{Vi^^^) 

. . . and so on 

where v{i) denotes value of i*'* column for record value =v. 

For the gamma-diagonal matrix A, and using rij to rep- 
resent rii=i I I' 8^'- '■^^ following expressions for 
these probabilities after some simple algebraic calculations; 

P{Va^b;Ua^b) = + !^ _ i)^ 

ni 

P{V,i=b-Ua^b) = 

711 

Then, for the j^^ attribute 

P(V^,,/T/,i,...,%-i);(7,) 



(7 + - 



-l)x 



ifyk<j,V^k = U, 



The core of the association rule mining is to identify 
"frequent itemsets", that is, all those itemsets whose sup- 
port (i.e. frequency) in the database is in excess of a user- 
specified threshold. Equation |8] can be directly used to es- 
timate the support of itemsets containing all M categorical 
attributes. However, in order to incorporate the reconstruc- 
tion procedure into bottom-up association rule mining al- 
gorithms such as Apriori f2], we need to also be able to 
estimate the supports of itemsets consisting of only a subset 
of attributes. 

Let C denotes the set of all attributes in the database, 
and Cs be a subset of attributes. Each of the attributes 
j € Cs can assume one of the \SIj\ values. Thus, the 
number of itemsets over attributes in Cs is given by nc^ — 
Yijec \'^u\- L^*- ^ denote itemsets over this subset of 
attributes. 

We say that record supports an itemset C over Cg if the 
entries in the record for the attributes j G Cs are same as in 
C. 

Let support of an itemset C in original and distorted 
database be denoted by sup^ and sup^, respectively. Then, 



ik 



V 

sup}^ 



1 

N 



E 

V supports jC 



o.w. 



(26) 

where pk is the probability that Vik takes value a, given that 
a is the outcome of the random process performed for k^^ 
attribute, i.e. 

Pk = P{Vik = a/Vii, Vi(^k~i);Ui) 

Therefore, to achieve the desired random perturbation 
for a value in column j, we use as input both its original 
value and the perturbed value of the previous column j — I, 
and generate the perturbed value as per the discrete distri- 
bution given in Equation|26l Note that is an example of de- 
pendent column perturbation, in contrast to the independent 
column perturbation used in most of the prior techniques. 

To assess the complexity, it is easy to see that the average 
number of iterations for the j*'* discrete distribution will 
be 1 5*^1/2, and hence the average number of iterations for 
generating a perturbed record will be J^j 1*5'^/ 1/2 (this value 
turns out to be exactly M for a boolean database). 

6. Application to Association Rule Mining 

To illustrate the utility of the FRAPP framework, we 
demonstrate in this section how it can be used for enhanc- 
ing privacy-preserving mining of association rules, a popu- 
lar mining model that identifies interesting correlations be- 
tween database attributes fTll21l. 



where denotes the number of records in V with value v 
(refer Section l2!2t . From Equation0 we know 



(27) 



Hence, 



V 

supc 



1 

N 



E 

V supports C U 



N 



-E 



V supports £1 



N ^ E 

?^ U supports 'hi 



Xn, 



E 

V supports C 



If for all u which support a given itemset 7i, 
I]t,supp„„s£^"" = ^nc, i-e. it is equal for all u which 
support a given itemset, then the above equation can be 
written as: 



V 

supc 



HC 

11 supports T't 



X,, 



sup^-j^ 



Now we find the matrix A for our gamma-diagonal ma- 
trix. Through some simple algebra, we get following matrix 
A corresponding to itemsets over subset Cs, Hence, 
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7^ + (^-lK ifU^C 



(28) 

Using the above nc^ x nc^ matrix we can estimate sup- 
port of itemsets over any subset Cs of attributes. Thus our 
scheme can be implemented on popular bottom-up associa- 
tion rule mining algorithms. 

7. Performance Analysis 

We move on, in this section, to quantify the utility of the 
FRAPP framework with respect to the privacy and accuracy 
levels that it can provide for mining frequent itemsets. 

Datasets. We use the following real world datasets in our 
experiments: 

CENSUS : This dataset contains census information for 
approximately 50,000 adult American citizens. It is 
available from the UCI repository 1 26 1, and is a popu- 
lar benchmark in data mining studies. It is also repre- 
sentative of a database where there are fields that users 
may prefer to keep private - for example, the "race" 
and "sex" attributes. We use three continuous (age, 
fnlwgt, hours-per-week) and three nominal 
attributes (native-country, sex, race) from 
the census database in our experiments. The continu- 
ous attributes are partitioned into (five) equiwidth in- 
tervals to convert them into categorical attributes. The 
categories used for each attribute are listed in Tabled 

HEALTH : This dataset captures health information for 
over 100,000 patients collected by the US government 
||271. We selected 3 continuous and 4 nominal at- 
tributes from the dataset for our experiments. The con- 
tinuous attributes were partitioned into equi-width in- 
tervals to convert them into categorical attributes. The 
attributes and their categories are listed in Table|2] 

We evaluated the association rule mining accuracy of our 
schemes on the above datasets for supmin = 2%. Table|3l 
gives the number of frequent itemsets in the datasets for 

SUPram = 2%. 



Table 3. Frequent Itemsets for supmin = 02 





1 


2 


Item 
3 


set Len 
4 


gth 
5 


6 


7 


CENSUS 
HEALTH 


19 

23 


102 
123 


203 
292 


165 
361 


64 
250 


10 
86 


12 



Privacy Metric. The (pi , P2) strict privacy measure from 
J131 is used as the privacy metric. While we experimented 
with a variety of privacy settings, due to space limitations, 
we present results here for a sample (pi, P2) = (5%, 50%), 
which was also used in 1131 . This privacy value results in 
7 = 19. 

Accuracy Metrics. We evaluate two kinds of mining er- 
rors. Support Error and Identity Error, in our experiments: 

Support Error (p) This metric reflects the (percentage) 
average relative error in the reconstructed support val- 
ues for those itemsets that are correctly identified to be 
frequent. Denoting the number of frequent itemsets by 
\F\, the reconstructed support by sup and the actual 
support by sup, the support error is computed over all 
frequent itemsets as 

\F\ supf 

Identity Error (cr) This metric reflects the percentage er- 
ror in identifying frequent itemsets and has two com- 
ponents: cr+, indicating the percentage of false posi- 
tives, and (7^ indicating the percentage of false nega- 
tives. Denoting the reconstructed set of frequent item- 
sets with R and the correct set of frequent itemsets 
with F, these metrics are computed as: 

a+ = ^-^^ * 100 a- = * 100 

Perturbation Meclianisms. We show frequent-itemset- 
mining accuracy results for our proposed perturbation 
mechanisms as well as representative prior techniques. For 
all the perturbation mechanisms, the mining from the dis- 
torted database was done using Apriori [2J algorithm, with 
an additional support reconstruction phase at the end of each 
pass to recover the original supports from the perturbed 
database supports computed during the pass [18 .8 1. 

The perturbation mechanisms evaluated in our study are 
the following: 

DET-GD: This schemes uses the deterministic gamma- 
diagonal perturbation matrix A (Section |3} for per- 
turbation and reconstruction. The implementation de- 
scribed in Section|5]was used to carry out the perturba- 
tion, and the results of Section|6lwere used to compute 
the perturbation matrix used in each pass of Apriori for 
reconstruction. 

RAN-GD: This scheme uses the randomized gamma- 
diagonal perturbation matrix A (Section|4} for pertur- 
bation and reconstruction. Though in general, any dis- 
tribution can be used for A, here we evaluate the per- 
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Table 1 . census Dataset 



Attribute 


Categories 


age 


(15 - 35], (35 - 55], (55 - 75], > 75 


fnlwgt 


(0 - le5], (le5 - 2e5], (le5 - 3e5], (3e5 - 4e5], > 4e5 


hours-per-week 


(0 - 20], (20 - 40], (40 - 60], (60 - 80], > 80 


race 


White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black 


sex 


Female, Male 


native-country 


United- States, Other 



Table 2. health Dataset 



Attribute 


Categories 


AGE (Age) 


[0 - 20), [20 - 40), [40 - 60), [60 - 80), > 80) 


BDDAY12 (Bed days in past 12 months) 


[0 - 7), [7 - 15), [15 - 30), [30 - 60), > 60 


DV12 (Doctor visits in past 12 months) 


[0 - 7), [7 - 15), [15 - 30), [30 - 60), > 60 


PHONE (Has Telephone) 


Yes,phone number given; Yes, no phone number given; No 


SEX (Sex) 


Male ; Female 


INCFAM20 (Family Income) 


Less than $20,000; $20,000 or more 


HEALTH (Health status) 


Excellent; Very Good; Good; Fair; Poor 



formance of uniformly distributed A given by Equa- 
tion [l9l over the entire range of the randomization pa- 
rameter a. 

MASK: This is the perturbation scheme proposed in f 181, 
which is intended for boolean databases and is charac- 
terized by a single parameter 1 — p, which determines 
the probability of an attribute value being flipped. In 
our scenario, the categorical attributes are mapped to 
boolean attributes by making each value of the cate- 
gory an attribute. Thus, the M categorical attributes 
map to Mb = J2j I \ boolean attributes. 

The flipping probability 1 — p was chosen as the low- 
est value which could satisfy the constraints given by 
Equation |2| The constraint Vw : Vui,U2 : /"^ < 7 

is satisfied for MASK CH, if , , . . < 7. But, 

(1 — pj'^''' 

for each categorical attribute, one and only one of its 
associated boolean attributes takes value 1 in a partic- 
ular record. Therefore, all the records contain exactly 
M V, and the following condition is sufficient for the 
privacy constraints to be satisfied: 



. This equation was used to determine the appropriate 
value of p. Value of p turns out be 0.5610 and 0.5524 
respectively for CENSUS and HEALTH datasets for 
7 = 19. 



C&P: This is the Cut-and-Paste perturbation scheme pro- 
posed in 1 12], with algorithmic parameters K and p. 
To choose K, we varied K from to M, and for 
each K, p was chosen such that the matrix (Equa- 
tion ll2> satisfies the privacy constraints (Equation IS . 
The results reported here are for the {K, p) combina- 
tion giving the best mining accuracy. For 7 = 19 
K = 3, p — 0.494 turn out to be appropriate values. 

7.1. Experimental Results 

For the CENSUS dataset, the support (p) and identity 
(a^, (7+) errors of the four perturbation mechanisms (DET- 
GD, RAN-GD, MASK, C&P) is shown in Figure [H as a 
function of the length of the frequent itemsets. The cor- 
responding graphs for the HEALTH dataset are shown in 
Figure I2I In this graph for comparison, the performance of 
RAN-GD is shown for randomization parameter a = 7.1/2. 
Note that the support error (p) is plotted on a log-scale. 

In these figures, we first note that the performance of the 
DET-GD method is visibly better than that of MASK and 
C&P. In fact, as the length of the frequent itemset increases, 
the performance of both MASK and C&P degrades drasti- 
cally. MASK is not able to find any itemsets of length above 
4 for the CENSUS dataset, and above 5 for the HEALTH 
dataset, while C&P does not works after 3-length itemsets. 

The second point to note is that the accuracy of RAN- 
GD, although dealing with a randomized matrix, is only 
marginally lower than that of DET-GD. In return, it pro- 
vides a substantial increase in the privacy - its worst case 
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(determinable) privacy breach is only 33% as compared to 
50% with DET-GD. Figure |3l shows performance of RAN- 
GD over entire range of a, and the posterior probability 
range [p^ , p+J. It shows mining support reconstruction er- 
rors for itemset length 4. We can observe that the perfor- 
mance of RAN-GD does not deviate much from the derter- 
ministic case over the entire range, where as very low de- 
terminable posterior probability can be obtained for higher 
values of a. 

The primary reason for DET-GD and RAN-GD's good 
performance is the low condition number of their pertur- 
bation matrices. This is quantitatively shown in Figure |3 
which compares the condition numbers (on a log-scale) of 
the reconstruction matrices. Note that as the expected value 
of random matrix A is used for estimation in RAN-GD, 
and the random matrix used in experiments has expected 
value A (refer Equationll9> used in DET-GD, the condition 
numbers for two methods are equal. Here we see that the 
condition number for DET-GD and RAN-GD is not only 
low but also constant over all lengths of frequent item- 
sets (as mentioned before, the condition number is equal 
to 1 + ! "\\ ). In marked contrast, the condition number 
for MASK and C&P increase exponentially with increasing 
itemset length, resulting in drastic degradation in accuracy. 
Thus our choice of a gamma-diagonal matrix shows highly 
promising results for discovery of long patterns. 

8. Related Work 

The issue of maintaining privacy in data mining has at- 
tracted considerable attention in the recent past. 

The work closest to our approach is that of fT2l 
I18II13I . In the pioneering work of |3|, privacy -preserving 
data classifiers based on adding noise to the record values 
were proposed. This work was extended in (7) and 1161 to 
address a variety of subtle privacy loopholes. 

New randomization operators for maintaining data pri- 
vacy for boolean data were presented and analyzed in 
1121 1181 . These methods are for categorical/boolean data 
and are based on probabilistic mapping from domain space 
to the range space rather than by incorporating additive 
noise to continuous valued data. A theoretical formulation 
of privacy breaches for such methods and a methodology 
for Umiting them were given in the foundational work of 

G3 

Our work is directly related to the above-mentioned 
methodologies for privacy preserving mining. We combine 
the approaches for random perturbation on categorical data 
into a common theoretical framework, and explore how well 
random perturbation methods can do in the face of strict 
privacy requirements. We show that we can derive a pertur- 
bation matrix which performs significantly better than the 
existing methods for discovery of frequent itemsets in cat- 



egorical data while simultaneously ensuring strict privacy 
guarantees. Also, we propose the novel idea of making the 
perturbation matrix itself random which, to the best of our 
knowledge, has not been previously explored in the context 
of privacy preserving mining. 

Another model of privacy preserving data mining is k- 
anonymity model |23|. The perturbation approach used in 
random perturbation model works under the strong privacy 
requirement that even the dataset forming server is not al- 
lowed to learn or recover precise records. Users trust no- 
body and perturb their record at their end before providing it 
to any other party, k-anonymity model 1 23 1 does not satisfy 
this requirements. The condensation approach discussed in 
1 9 1 also requires the relaxation of the assumption that even 
the data forming server is not allowed to learn or recover 
records, as in k-anonymity model. Hence these models are 
orthogonal to our privacy model. 

l6l 131 [n] p I deal with Hippocratic databases which are 
the database systems that take responsibility of the privacy 
of data they manage. It involves specification of how the 
data is to be used in a privacy policy and enforcing limited 
disclosure rules for regulatory concerns prompted by legis- 
lation. 

Finally, the problem addressed in I19lll0lfmi20l is how 
to prevent sensitive rules from being inferred by the data 
miner - this work is complementary to ours since it ad- 
dresses concerns about output privacy, whereas our focus 
is on the privacy of the input data. Maintaining input data 
privacy is considered in |I 24 | [Tsl ?, ?] in the context of 
databases that are distributed across a number of sites with 
each site only willing to share data mining results, but not 
the source data. 

9. Conclusions and Future Work 

In this paper, we developed FRAPP, a generalized model 
for random-perturbation-based methods operating on cate- 
gorical data under strict privacy constraints. We showed 
that by making careful choices of the model parameters 
and building perturbation methods for these choices, order- 
of-magnitude improvements in accuracy could be achieved 
as compared to the conventional approach of first deciding 
on a method and thereby implicitly fixing the associated 
model parameters. In particular, we proved that a "gamma- 
diagonal" perturbation matrix is capable of delivering the 
best accuracy, and is in fact, optimal with respect to its 
condition number. We presented an implementation tech- 
nique for gamma-diagonal-based perturbation, whose com- 
plexity is proportional to the sum of the domain cardinali- 
ties of the attributes in the database. Empirical evaluation 
of our new gamma-diagonal-based techniques on the CEN- 
SUS and HEALTH datasets showed substantial reductions 
in frequent itemset identity and support reconstruction er- 
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o/(yh) 0/(7 h) aJ{yh) 

(a) (b) (c) 

Figure 3. (a) Posterior probability ranges (b) Support error p for CENSUS (c) Support error p for HEALTH dataset with varying 
degree of randomization 
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(a) (b) 
Figure 4. Comparison of condition number of transition probability matrix (a) CENSUS (b) HEALTH 



rors. 

We also investigated the novel strategy of having the per- 
turbation matrix composed of not values, but random vari- 
ables instead. Our analysis of this approach showed that, 
at a marginal cost in accuracy, signficant improvements in 
privacy levels could be achieved. 

In our future work, we plan to extend our modeling ap- 
proach to other flavors of mining tasks. 
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